Chapter 3 Exploratory analysis
Here you can check data fields and fields descriptions for all variables appearing in the dataset
Variables related with the trade process:
- control_number: It represents a unique individual shipment processed by the USFWS.
- quantity: It represents the numeric quantity of the wildlife produc
- unit: It represents the unit for the numeric quantity
- import_export: It represents whether the shipment is an (I)mport or (E)xport
- action: Action taken by USFWS on import ((C)leared/(R)efused)
- shipment_date: Full date when shipment arrived
- shipment_year: Year when the shipment arrived (derived from “shiptment_year”)
- disposition: Fate of the import
- disposition_date: Full date when disposition occurred
- disposition_year: Year when disposition occurred (derived from “disposition_date”)
Variables related with the countries:
- country_origin: It represents the code for the country of origin of the wildlife product
- country_imp_exp: It represents the code for the country to/from which the wildlife product is shipped
- port: It represents the port or region of shipment entry
- us_co: It represents the US party of the shipment
- foreign_co: It represents the foreign party of the shipment
Variables related with the product:
- description: It represents the type/form of the wildlife product
- value: It represents the reported value of the wildlife product in US dollars
- purpose: It represents the reason the wildlife product is being imported
- source: It represents the type of source within the origin country (e.g., wild, bred)
- species_code: It represents the USFWS code for the wildlife product
- taxa: It represents the USFWS-derived broad taxonomic categorization
- class: It represents the EHA-derived class-level taxonomic designation
- genus: It represents the Genus (or higher-level taxonomic name) of the wildlife product
- species: It represents species of the wildlife product
- subspecies: It represents subspecies of the wildlife product
- specific_name: It represents a specific common name for the wildlife product
- generic_name: It represents a general common name for the wildlife product
3.1 Variables related with the trade process:
3.1.1 control_number
It represents a unique individual shipment processed by the USFWS.
- There are 2,088,676 unique shiptments
- Different wildlife products contained within the same shipment may be represented in the LEMIS data by multiple data rows, all of which share a common ‘control_number’.
- All rows of data sharing the same ‘control_number’ share the same country of shipment and shipment date.
3.1.2 quantity and unit
Quantity represents the numeric quantity of the wildlife product, while unit represents the unit of measure for the numeric quantity.
- There are 13 types of units.
- 94.64% of data is represented with the unit “numbers”
# Renaming levels
data <- data %>%
mutate(unit = recode(unit, "NO" = "Number", "KG" = "Kilograms",
"GM"= "Grams", "M2"="Square_meters",
"LT"= "Liters", "ML"= "Milliliters",
"MT"="Meters", "CM" = "Centimeters",
"C3"= "Cubic_centimeters", "C2"="Square_centimeters",
"MG"="Miligrams", "M3"="Cubic_meters"))
units <-data %>% group_by(unit) %>%
summarise(total = n(),percentage=n()/nrow(data)) %>%
drop_na(unit) %>%
arrange(desc(percentage))
DT::datatable(units,
caption = htmltools::tags$caption(
style='caption-side: bottom; text-align: center;','Table 1: ',
htmltools::em('Units of measure'
))) %>%
formatRound('total',1) %>%
formatPercentage('percentage',2)3.1.3 import_export
It represents Whether the shipment is an (I)mport or (E)xport. In this dataset, 100% of the data is an import.
# Renaming levels
data <- data %>%
mutate(import_export = recode(import_export, "I" = "Import"))
imports <-data %>% group_by(import_export) %>%
summarise(total = n(), percentage=n()/nrow(data)) %>%
arrange(desc(percentage))
DT::datatable(imports,
caption = htmltools::tags$caption(
style='caption-side: bottom; text-align: center;','Table 2: ',
htmltools::em('Shipments: Imports and exports'
))) %>%
formatRound('total',1) %>%
formatPercentage('percentage',2)imports %>% mutate(percentage=percentage*100) %>%
plot_ly(x=~reorder(import_export, desc(percentage)), y=~percentage, color=~import_export) %>%
add_bars() %>%
layout(title = "<b>Shipments: Imports and exports</b>",
xaxis= list(title= "<b>Imports and exports</b>" ,tickangle=-65),
yaxis = list(title = "<b>Percentage</b>"))3.1.4 action
Action taken by the USFWS on import ((C)leared/(R)efused)
- 98.28% of imports are cleared, just 1.73% is refused
# Renaming levels
data <- data %>%
mutate(action = recode(action, "C" = "Cleared", "R"= "Refused"))
action <-data %>% group_by(action) %>%
summarise(total = n(), percentage=n()/nrow(data)) %>%
drop_na(action) %>%
arrange(desc(percentage))
DT::datatable(action,
caption = htmltools::tags$caption(
style='caption-side: bottom; text-align: center;','Table 3: ',
htmltools::em('Action taken by the USFWS on imports'
))) %>%
formatRound('total',1) %>%
formatPercentage('percentage',2)action %>% mutate(percentage=percentage*100) %>%
plot_ly(x=~reorder(action, desc(percentage)), y=~percentage, color=~action) %>%
add_bars() %>%
layout(title = "<b>Action taken by the USFWS on imports</b>",
xaxis= list(title= "<b>Actions</b>" ,tickangle=-65),
yaxis = list(title = "<b>Percentage</b>"))3.1.5 disposition
It represents the fate of the import
- There are 5 categories: C, S, A, R and non-standard value
- The C category represents 98.3% of data
disposition <-data %>% group_by(disposition) %>%
summarise(total = n(), percentage=n()/nrow(data)) %>%
drop_na(disposition) %>%
arrange(desc(percentage))
DT::datatable(disposition,
caption = htmltools::tags$caption(
style='caption-side: bottom; text-align: center;','Table 4: ',
htmltools::em('Fate of the import'
))) %>%
formatRound('total',1) %>%
formatPercentage('percentage',2)disposition %>% mutate(percentage=percentage*100) %>%
plot_ly(x=~reorder(disposition, desc(percentage)), y=~percentage, color=~disposition) %>%
add_bars() %>%
layout(title = "<b>Fate of the import</b>",
xaxis= list(title= "<b>Dispositions</b>" ,tickangle=-65),
yaxis = list(title = "<b>Percentage</b>"))3.1.6 shipment_date and disposition_date
Shipment_date represents the date when shipment arrived, while disposition_date represents the date when disposition occurred.
54% of dispositions took place within a month of the shipment date (most of them within a week)
- While ‘shipment_date’ entries fell completely within the time period of 2000–2014, ‘disposition_date’ ranged more widely
- Users should be wary of any disposition date values that precede the associated shipment date, as we are unaware of how this could represent an accurate accounting of the product disposition process. However, for many potential analyses, differences in the date fields may not be a significant cause for concern because ‘shipment_date’ alone provides a sound index for those interested in temporal trends in wildlife trade
days<- data %>%
mutate(days = as.factor(as.numeric(disposition_date - shipment_date))) %>%
group_by(days) %>%
summarise(total = n(), percentage=n()/nrow(data)) %>%
filter(days %in% c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",
"11", "12", "13", "14", "15", "16", "17", "18", "19", "20",
"21", "22", "23", "24", "25", "26", "27", "28", "29", "30"))
DT::datatable(days,
caption = htmltools::tags$caption(
style='caption-side: bottom; text-align: center;','Table 5: ',
htmltools::em('Days between shipment and disposition date'
))) %>%
formatRound('total',1) %>%
formatPercentage('percentage',2)3.2 Variables related with the countries
3.2.1 country_origin
It represents the code for the country of origin of the wildlife product
- There are 252 countries of origin
- The top 15 represents 74.5% of data
- The top 50 represents 94.4% of data
# Renaming levels)
data$country_origin<- countrycode(data$country_origin, "iso2c",
"country.name", nomatch = NULL)
data <- data %>%
mutate(country_origin = recode(country_origin, "AN" = "Netherlands Antilles",
"XX" = "Unknown","ZZ" = "High Seas"))
country_origin <-data %>% group_by(country_origin) %>%
summarise(total = n(), percentage=n()/nrow(data)) %>%
drop_na(country_origin) %>%
arrange(desc(percentage))
DT::datatable(country_origin,
caption = htmltools::tags$caption(
style='caption-side: bottom; text-align: center;','Table 6: ',
htmltools::em('Country of origin of the wildlife product'
))) %>%
formatRound('total',1) %>%
formatPercentage('percentage',2)country_origin %>% mutate(percentage=percentage*100) %>%
top_n(30, percentage) %>%
plot_ly(x=~reorder(country_origin, desc(percentage)), y=~percentage,
color=~country_origin) %>%
add_bars() %>%
layout(title = "<b>Country of origin of the wildlife product</b>",
xaxis= list(title= "<b>Country of origin</b>" ,tickangle=-65),
yaxis = list(title = "<b>Percentage</b>"))3.2.2 country_imp_exp
It represents the code for the country to/from which the wildlife product is shipped
- There are 257 countries to/from which the product is shipped
- The top 15 represents 75.9% of data
- The top 50 represents 95.9% of data
# Renaming levels)
data$country_imp_exp<- countrycode(data$country_imp_exp, "iso2c",
"country.name", nomatch = NULL)
data <- data %>%
mutate(country_imp_exp = recode(country_imp_exp, "AN" = "Netherlands Antilles",
"XX" = "Unknown","ZZ" = "High Seas"))
country_imp_exp <-data %>% group_by(country_imp_exp) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(country_imp_exp) %>%
arrange(desc(total))
DT::datatable(country_imp_exp) %>%
formatPercentage('total',2)country_imp_exp %>%
mutate(total=total*100) %>%
top_n(15, total) %>%
ggplot(aes(x=reorder(country_imp_exp, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="country_imp_exp") 
Most problematic countries are on both sides (country of origin & country to/from which the wildlife product is shipped)
country_origin <-data %>% group_by(country_origin) %>%
summarise(total=n()/nrow(data) *100) %>%
drop_na(country_origin) %>%
arrange(desc(total)) %>% top_n(4, total) %>%
rename(country = country_origin)
country_imp_exp <-data %>% group_by(country_imp_exp) %>%
summarise(total=n()/nrow(data) *100) %>%
drop_na(country_imp_exp) %>%
arrange(desc(total)) %>% top_n(4, total) %>%
rename(country = country_imp_exp)
countries<- combine(country_origin, country_imp_exp)
ggplot(data=countries, aes(x=reorder(country, -total), y=total, fill=source)) +
geom_bar(stat="identity", position=position_dodge())+
geom_text(aes(label=round(total,2)),vjust=1.6,position = position_dodge(0.9), size=3.5)+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="country") 
3.2.3 port
It represents the port or region of shipment entry
- There are 73 ports
- The top 15 represents 83.9% of data
# Renaming levels
data <- data %>%
mutate(port = recode(port, "LA" = "Los Angeles, CA","NY" = "New York, NY",
"MI" = "Miami, FL", "NW" = "Newark, NJ",
"SF" = "San Francisco, CA", "CH" = "Chicago, IL",
"DF" = "Dallas/Fort Worth, TX",
"LO" = "Louisville, KY ", "AN" = "Anchorage, AK",
"HN"= "Houston, TX", "ME" = "Memphis, TN",
"AT" = "Atlanta, GA","HA" = "Honolulu, HI",
"SE" = "Seattle, WA", "PB" = "Pembina, ND"))
port <-data %>% group_by(port) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(port) %>%
arrange(desc(total))
DT::datatable(port) %>%
formatPercentage('total',2)port %>%
mutate(total=total*100) %>%
top_n(15, total) %>%
ggplot(aes(x=reorder(port, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="port") 
3.2.4 us_co
It represents the US party of the shipment
- We have excluded the “EXEMPTIONS 6 AND 7(C)” from the analysis.
- There are 126,052 US parties
- The top 15 just represents 10.3% of data
- The top 50 just represents 19 % of data
us_co <-data %>%
filter(!us_co == "EXEMPTIONS 6 AND 7(C)") %>%
group_by(us_co) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(us_co) %>%
arrange(desc(total))
DT::datatable(us_co) %>%
formatPercentage('total',3)us_co %>%
mutate(total=total*100) %>%
top_n(15, total)%>%
ggplot(aes(x=reorder(us_co, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="us_co") 
We want to analyze which American corporations are importing the most. So, we have grouped the mayor corporations based on the company names.
We’ve grouped most of those companies that represent at least 0.15% of the data, covering approximately 1.000.000 observations.
data$corporation<- # Fashion/Luxury/Design products
ifelse(grepl("prada", data$us_co, ignore.case = TRUE), "Prada",
ifelse(grepl("ralph lauren", data$us_co, ignore.case = TRUE),
"Ralph Lauren",
ifelse(grepl("LOUIS VUITTON", data$us_co, ignore.case = TRUE),
"Louis Vuitton",
ifelse(grepl("MONCLER", data$us_co, ignore.case = TRUE),
"Moncler",
ifelse(grepl("BOTTEGA VENETA", data$us_co, ignore.case = TRUE),
"Bottega Veneta",
ifelse(grepl("RICHEMONT", data$us_co, ignore.case = TRUE),
"Richemont",
ifelse(grepl("FENDI", data$us_co, ignore.case = TRUE),
"Fendi",
ifelse(grepl("HERMES", data$us_co, ignore.case = TRUE),
"Hermès",
ifelse(grepl("GUCCI", data$us_co, ignore.case = TRUE),
"Gucci",
ifelse(grepl("Beeline", data$us_co, ignore.case = TRUE),
"Beeline Group",
ifelse(grepl("fossil partner", data$us_co, ignore.case = TRUE),
"Fossil Partners, L.P.",
ifelse(grepl("dfs", data$us_co, ignore.case = TRUE),
"DFS Group",
ifelse(grepl("gluck", data$us_co, ignore.case = TRUE),
"E. Gluck Corporation",
ifelse(grepl("ferragamo", data$us_co, ignore.case = TRUE),
"Salvatore Ferragamo",
ifelse(grepl("jacadi", data$us_co, ignore.case = TRUE),
"Jacadi",
ifelse(grepl("bomac", data$us_co, ignore.case = TRUE),
"Bomac International Corp",
ifelse(data$us_co %in% c("PIER I IMPORTS, INC.",
"PIER 1 IMPORTS, INC. "),
"Pier 1",
# Museums
ifelse(grepl("museum", data$us_co, ignore.case = TRUE),
"Museums",
ifelse(grepl("smithsonian", data$us_co, ignore.case = TRUE),
"Museums",
# Animals or animal products providers
ifelse(grepl("SEA DWELLING", data$us_co, ignore.case = TRUE),
"Sea Dwelling creatures",
ifelse(grepl("HIPPOCAMPE", data$us_co, ignore.case = TRUE),
"Hippocampe USA",
ifelse(grepl("AQUA-NAUTIC", data$us_co, ignore.case = TRUE),
"Aqua Nautic Specialist",
ifelse(grepl("UNDERWATER WORLD", data$us_co, ignore.case = TRUE),
"Underwater World",
ifelse(grepl("GOLDEN INA", data$us_co, ignore.case = TRUE),
"Golden Ina",
ifelse(grepl("QUALITY MARINE", data$us_co, ignore.case = TRUE),
"Quality Marine",
ifelse(grepl("Arsian", data$us_co, ignore.case = TRUE),
"Arsian Imports",
ifelse(grepl("aquarium arts", data$us_co, ignore.case = TRUE),
"Aquarium Arts",
ifelse(grepl("WALT SMITH", data$us_co, ignore.case = TRUE),
"Walt Smith International",
ifelse(grepl("all seas fisheries", data$us_co, ignore.case = TRUE),
"Allseas Fisheries",
ifelse(grepl("sun pet ltd", data$us_co, ignore.case = TRUE),
"Sun Pet LTD",
ifelse(grepl("pacific aqua farms", data$us_co, ignore.case = TRUE),
"Pacific Aquafarms",
ifelse(grepl("INTINENTAL", data$us_co, ignore.case = TRUE),
"Intinental Pri",
ifelse(grepl("AQUACO", data$us_co, ignore.case = TRUE),
"Aquaco",
ifelse(grepl("all seas marine", data$us_co, ignore.case = TRUE),
"Allseas Marine",
ifelse(grepl("transship discounts", data$us_co, ignore.case = TRUE),
"Transship Discounts LTD",
ifelse(grepl("SEGREST FARMS", data$us_co, ignore.case = TRUE),
"Segrest Farms",
ifelse(grepl("fish head", data$us_co, ignore.case = TRUE),
"Fish Heads Inc",
ifelse(grepl("holiday coral", data$us_co, ignore.case = TRUE),
"Holiday Coral Inc",
ifelse(grepl("pacific island imp", data$us_co, ignore.case = TRUE),
"Pacific Island Imports",
ifelse(grepl("PAN OCEAN AQUARIUM", data$us_co, ignore.case = TRUE),
"Pan Ocean Aquarium, Inc",
ifelse(grepl("SALTWATER INC.", data$us_co, ignore.case = TRUE),
"Saltwater Inc",
ifelse(grepl("saltwaterfish", data$us_co, ignore.case = TRUE),
"Saltwaterfish",
ifelse(grepl("golden sea int", data$us_co, ignore.case = TRUE),
"Golden Sea Inc",
ifelse(grepl("strictly reptiles", data$us_co, ignore.case = TRUE),
"Strictly Reptiles Inc",
ifelse(grepl("emark tropical", data$us_co, ignore.case = TRUE),
"Emark Tropical Imports, Inc",
ifelse(data$us_co %in% c("DOLPHIN INTERNATIONAL", "DOLPHIN INT'L",
"DOLPHIN INTERNAITONAL"),
"Dolphin International",
ifelse(data$us_co %in% c("a & m aquatics", "A&M AQUATICS"),
"A&M Aquatics",
ifelse(data$us_co %in% c("LPS LLC", "LPS, LLC", "LPS"),
"LPS LLC",
ifelse(data$us_co %in% c("APET, INC", "APET INC"),
"Apet Inc",
"Other")))))))))))))))))))))))))))))))))))))))))))))))))
data_corporations <- data %>%
mutate(corporation = as.factor(corporation)) %>%
filter(corporation!="Other") # 1,004,219
# Let's classify these corporations
corp_fashion<- c("Beeline Group", "Bomac International Corp", "Bottega Veneta",
"DFS Group","E. Gluck Corporation", "Fendi",
"Fossil Partners, L.P.", "Gucci", "Hermès", "Jacadi",
"Louis Vuitton","Moncler", "Pier 1","Prada", "Ralph Lauren",
"Richemont", "Salvatore Ferragamo")
corp_animalproviders<-c("A&M Aquatics", "Aqua Nautic Specialist", "Aquaco",
"Aquarium Arts", "Allseas Fisheries", "Allseas Marine",
"Apet Inc", "Arsian Imports", "Dolphin International",
"Emark Tropical Imports, Inc", "Fish Heads Inc",
"Golden Ina", "Golden Sea Inc", "Hippocampe USA",
"Holiday Coral Inc", "Intinental Pri", "LPS LLC",
"Pacific Aquafarms", "Pacific Island Imports",
"Pan Ocean Aquarium, Inc", "Quality Marine",
"Saltwater Inc", "Saltwaterfish", "Sea Dwelling creatures",
"Segrest Farms", "Strictly Reptiles Inc", "Sun Pet LTD",
"Transship Discounts LTD", "Underwater World",
"Walt Smith International")
data$corp_classif<- ifelse(data$corporation %in% corp_fashion, "Fashion/Luxury/Design",
ifelse(data$corporation %in% corp_animalproviders,
"Animal/animal prod. providers",
ifelse(data$corporation %in% "Museums", "Museums", "Others")))
data$corporation<-as.factor(data$corporation)
data$corp_classif<- as.factor(data$corp_classif)
# Let's summarize and graph these corporations
data_corporations<- data %>%
group_by(corp_classif, corporation) %>%
dplyr::summarise(total=n(), percentage=n()/nrow(data)) %>%
arrange(desc(total))
DT::datatable(data_corporations) %>%
formatPercentage('percentage',3) %>%
formatCurrency('total',currency = "", interval = 3, mark = ",")data_corporations %>%
mutate(percentage=percentage*100) %>%
filter(corp_classif != "Others") %>%
ungroup() %>%
top_n(15, percentage) %>%
ggplot(aes(x=reorder(corporation, -percentage), y=percentage,fill=corp_classif)) +
geom_bar(stat="identity") +
geom_text(aes(label=round(percentage,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="US corporations") 
## Graph by classification
data_corporations %>%
mutate(percentage=percentage*100) %>%
filter(corp_classif == "Fashion/Luxury/Design") %>%
ggplot(aes(x=reorder(corporation, -percentage), y=percentage)) +
geom_bar(stat="identity", fill="darkgreen") +
geom_text(aes(label=round(percentage,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="Fashion/Luxury/Design Corporations") 
data_corporations %>%
mutate(percentage=percentage*100) %>%
filter(corp_classif == "Animal/animal prod. providers") %>%
top_n(15, percentage) %>%
ggplot(aes(x=reorder(corporation, -percentage), y=percentage,fill=corp_classif)) +
geom_bar(stat="identity") +
geom_text(aes(label=round(percentage,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="Animal/animal prod. providers Corporations") 
3.2.5 foreign_co
It represents the foreign party of the shipment
- There are 237,994 foreign parties
- The top 15 just represents 5.4% of data
- The top 50 just represents 12.4% of data
foreign_co <-data %>% group_by(foreign_co) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(foreign_co) %>%
arrange(desc(total)) %>% top_n(15, total)
DT::datatable(foreign_co) %>%
formatPercentage('total',3)foreign_co %>%
mutate(total=total*100) %>%
top_n(15, total)%>%
ggplot(aes(x=reorder(foreign_co, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="foreign_co") 
3.3 Variables related with the product
3.3.1 description
It represents the type/form of the wildlife product
- There are 88 types/forms
- Most of them are “Live specimens” (29.1%), followed by trophies (16.83%) and “Shell products” (9.49%).
- The top 15 represents 94.19% of data
# Dropping unused levels
data$description <- droplevels(data$description)
# Renaming levels)
data <- data %>%
mutate(description = recode(description,
"LIV" = "Live specimen", "TRO" = "Trophy",
"SPR" = "Shell product", "LPS" = "Leather product",
"JWL" = "Jewelry", "GAR" = "Garment",
"SHO" = "Shoe", "TRI" = "Trim", "MEA" = "Meat",
"SPE" = "Specimen", "SKI" = "Skin",
"BOD" = "Dead animal","LPL" = "Leather product",
"SHE" = "Shell", "FEA"= "Feather",
"HOR" = "Horn", "IVC" ="Ivory carving",
"IVP" = "Ivory piece", "UNS" = "Unspecified"))
description <-data %>% group_by(description) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(description) %>%
arrange(desc(total))
DT::datatable(description) %>%
formatPercentage('total',2)description %>%
mutate(total=total*100) %>%
top_n(15, total) %>%
ggplot(aes(x=reorder(description, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="description") 
3.3.2 value
It represents the reported value of the wildlife product in US dollars.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 30 187 4805 1260 118000000 1868566
# Data between the minimum value and the 3rd quartile
ggplot(data, aes(x=value))+
geom_histogram(color="darkblue", fill="lightblue", bins=30) +
xlim(c(0, 1260))
3.3.3 purpose
It represents the reason the wildlife product is being imported - There are 13 purpose levels - Commercial (74.6%) and “hunting trophies” (17.81%) purposes represent 92.41 % of the data.
# Renaming levels)
data <- data %>%
mutate(purpose = recode(purpose,
"B" = "Breeding in captivity or artificial propagation",
"E" = "Educational", "G" = "Botanical Gardens",
"H" = "Hunting trophies", "M" = "Biomedical research",
"P" = "Personal", "Q" = "Circuses/traveling exhibitions",
"S" = "Scientific", "T" = "Commercial",
"Y" = "Reintroduction/introduction into the wild",
"Z" = "Zoos"))
purpose <-data %>% group_by(purpose) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(purpose) %>%
arrange(desc(total))
DT::datatable(purpose) %>%
formatPercentage('total',2)purpose %>%
mutate(total=total*100) %>%
ggplot(aes(x=reorder(purpose, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="purpose") 
3.3.4 source
It represents the type of source within the origin country (e.g., wild, bred) - There are 10 source levels - Specimens taken from the wild (77.9%) and animals bred in captivity (14.8%) sources represent 92.7 % of the data.
# Renaming levels)
data <- data %>%
mutate(source = recode(source,
"W" = "Specimens taken from the wild",
"C" = "Animals bred in captivity",
"R" = "Specimens originating from a ranching operation",
"F" = "Animals born in captivity or not captive-bred",
"U" = "Source unknown",
"D" = "commercially bred or propagated in CITES",
"I" = "Confiscated or seized specimens",
"A" = "Plants artificially propagated, parts and derivates"))
source <-data %>% group_by(source) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(source) %>%
arrange(desc(total))
DT::datatable(source) %>%
formatPercentage('total',2)source %>%
mutate(total=total*100) %>%
ggplot(aes(x=reorder(source, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="source") 
3.3.5 species_code
It represents the USFWS code for the wildlife product
- There are 15, 322 unique species codes
- The top 15 represents 30 % of data
- The top 50 represents 50.4% of data
# Dropping unused levels
data$species_code <- droplevels(data$species_code)
species_code <-data %>% group_by(species_code) %>%
summarise(total=n()/nrow(data)) %>%
arrange(desc(total))
DT::datatable(species_code) %>%
formatPercentage('total',2)species_code %>%
mutate(total=total*100) %>%
top_n(15, total) %>%
ggplot(aes(x=reorder(species_code, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="Species Code") 
3.3.6 taxa
It represents the USFWS-derived broad taxonomic categorization
- There are 13 taxa levels
- There is a ‘taxa’ field for the vast majority (>99%) of records
- Most of them are mammals (29.16%), followed by shells (20.34%), reptiles (13.29%) and corals (12.63%). These four categories represent 75,42% of data
# Dropping unused levels
data$taxa <- droplevels(data$taxa)
taxa <-data %>% group_by(taxa) %>%
summarise(total=n()/nrow(data)*100) %>%
arrange(desc(total))
ggplot(data=taxa, aes(x=reorder(taxa, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="taxa") 
3.3.7 class
It represents the EHA-derived class-level taxonomic designation - There are 53 classes - The top 15 represents 93% of data
# Dropping unused levels
data$class <- droplevels(data$class)
class <-data %>% group_by(class) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(class) %>%
arrange(desc(total))
DT::datatable(class) %>%
formatPercentage('total',2)class %>%
mutate(total=total*100) %>%
top_n(15, total) %>%
ggplot(aes(x=reorder(class, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="class") 
3.3.8 genus
It represents the Genus (or higher-level taxonomic name) of the wildlife product
- There are 6,335 genus levels
- The top 15 represents 33.4 of data
- The top 50 represents 57.3% of data
# Dropping unused levels
data$genus <- droplevels(data$genus)
genus <-data %>% group_by(genus) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(genus) %>%
arrange(desc(total))
DT::datatable(genus) %>%
formatPercentage('total',2)genus %>%
mutate(total=total*100) %>%
top_n(15, total) %>%
ggplot(aes(x=reorder(genus, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="genus") 
3.3.9 species
It represents species of the wildlife product. For getting a more ilustrative name, we’ll combine genus and species.
- After removing generics, there are 8,082 different species
- The top 15 represents 23.8 % of data
- The top 50 represents 38.3 % of data
# Dropping unused levels
data$species <- droplevels(data$species)
# Removing generics
species<- data %>%
filter(!species %in% c("maxima", "sp.", "in trop fish &", "(marine sp.)",
"(freshwater sp.)", "(including goldfish)")) %>%
drop_na(species)
# Combine genus + species
species <- species %>% mutate(species = as.factor(paste(genus, "-",species)))
species <-species %>% group_by(species) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(species) %>%
arrange(desc(total)) %>%
ungroup()
DT::datatable(species) %>%
formatPercentage('total',2)species %>%
mutate(total=total*100) %>%
top_n(15, total) %>%
ggplot(aes(x=reorder(species, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="species") 
3.3.10 subspecies
It represents subspecies of the wildlife product
- There are 439 subspecies levels
- The top 15 just represents 2.5% of data
- The top 50 just represents 2.6% of data
# Dropping unused levels
data$subspecies <- droplevels(data$subspecies)
subspecies <-data %>% group_by(subspecies) %>%
summarise(total=n()/nrow(data)) %>%
filter(subspecies!="other shipments") %>%
arrange(desc(total))
DT::datatable(subspecies) %>%
formatPercentage('total',2)subspecies %>%
mutate(total=total*100) %>%
top_n(15, total) %>%
ggplot(aes(x=reorder(subspecies, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="subspecies") 
3.3.11 generic_name
It represents a general common name for the wildlife product
- There are 1,987 generic names levels
- The top 15 represents 49.3% of data
- The top 50 represents 71.6% of data
# Dropping unused levels
data$generic_name <- droplevels(data$generic_name)
generic_name <-data %>% group_by(generic_name) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(generic_name) %>%
arrange(desc(total))
DT::datatable(generic_name) %>%
formatPercentage('total',2)generic_name %>%
mutate(total=total*100) %>%
top_n(15, total) %>%
ggplot(aes(x=reorder(generic_name, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="generic_name") 
3.3.12 specific_name
It represents a specific common name for the wildlife product. For getting a more ilustrative name, we’ll combine generic_name and specific_name
- There are 6,669 specific names levels
- The top 15 represents 36.3% of data
- The top 50 represents 58.7% of data
# Dropping unused levels
data$specific_name <- droplevels(data$specific_name)
# Combine generic_name and specific_name
data <- data %>% mutate(specific_name = paste(generic_name, "-",specific_name))
specific_name <-data %>% group_by(specific_name) %>%
summarise(total=n()/nrow(data)) %>%
drop_na(specific_name) %>%
arrange(desc(total))
DT::datatable(specific_name) %>%
formatPercentage('total',2)specific_name %>%
mutate(total=total*100) %>%
top_n(15, total) %>%
ggplot(aes(x=reorder(specific_name, -total), y=total)) +
geom_bar(stat="identity",fill="steelblue") +
geom_text(aes(label=round(total,2)))+
theme(axis.text.x = element_text(angle=60, hjust=1)) +
labs(y = "Percentage", x="specific_name") 
3.4 More data cleaning after exploratory analysis
Let’s exclude from the analysis those descriptions with less than 10 instances. We reduce the number of descriptions from 88 to 78
data<-data %>%
group_by(description) %>%
filter(n()>=10) # 5,451,832 <- 5,451,800 rows
# Dropping unused levels
data$description <- droplevels(data$description)Let’s exclude from the analysis those species with less than 10 instances. We reduce the number of species from 8,088 to 6,426